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Abstract. We perform a finite sample analysis of the detection levels 
| for sparse principal components of a high-dimensional covariance matrix. 

Ph " Our minimax optimal test is based on a sparse eigenvalue statistic. Alas, 

ff) . computing this test is known to be NP-complete in general and we describe 

^ | a computationally efficient alternative test using convex relaxations. Our 

relaxation is also proved to detect sparse principal components at near 
E~~ 1 ■ optimal detection levels and performs very well on simulated datasets. 
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>■ 1. INTRODUCTION 

o; 

f*"*- ■ The sparsity assumption has become preponderant in modern, high-dimensional 

statistics. In the high dimension, low sample size setting, where consistency seems 
to be hopeless, sparstiy turns out to be the statistician's salvation. It formal- 
izes the a priori belief that only a few parameters, among a large number of 
£Nj \ them, are significant for the statistical task at hand. This paper explores a spe- 

cific high-dimensional problem, namely Principal Component Analysis (PC A). 
Indeed, classical PCA is known to produce inconsistent estimators of the direc- 
tions that explain the most variance (Johnstone and Lu, 2009; Nadler, 2008; Paul, 
2007) without further assumption. For PCA, the spiked covariance model intro- 
duced by Johnstone (2001) directly encodes the sparsity assumption. Namely, this 
model relies on the assumption that there exists a few sparse directions that ex- 
plain most of the variance. Formally, we assume that the observations are drawn 
from a multivariate Gaussian distribution with mean zero and covariance matrix 
given by I + 9vv T , where / is the identity matrix and v is a unit norm sparse 
vector. Akin to other models, the sparsity assumption drives both methods and 
analysis in a wide variety of applications ranging from signal processing to biol- 
ogy (see, e.g., Alon et al., 1999; Chen, 2011; Jenatton et al., 2009; Wright et al., 
2011, for a few examples). Most contributions to this problem have focused on 
consistent estimation of the sparse principal component v for various performance 
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measures (see, e.g. Amini and Wainwright, 2009; Ma, 2011b; Shen et al., 2009, 
and the above references). 

What if there is no sparse component? In other words, what if 9 = 0? From 
a detection standpoint, one may ask the following question: How much variance 
should a sparse principal component explain in order to be detectable by a statis- 
tical procedure? Answering this question consists in (i) constructing a test that 
can detect this sparse principal component when the associated variance is above 
a certain level and (ii) proving that no test can detect such a principal component 
below a certain level. 

Optimal detection levels in a high-dimensional setup have recently received 
a lot of attention. More precisely, Arias-Castro et al. (2010); Donoho and Jin 
(2004); Ingster et al. (2010) have studied the detection of a sparse vector cor- 
rupted by noise under various sparsity assumptions. More recently, this prob- 
lem has been extended from vectors to matrices by Butucea and Ingster (2011); 
Sun and Nobel (2008, 2010) who propose to detect a shifted sub- matrix hidden 
in a Gaussian or binary random matrix. While the notion of sub-matrix encodes 
a certain sparsity structure, these two papers focus on the elementwise proper- 
ties of random matrices, unlike the blooming random matrix theory that focuses 
on spectral aspects. Arias-Castro et al. (2011) studied a problem somewhere be- 
tween sparse PCA detection and the shifted sub-matrix problem. Their goal is 
to detect a shifted off-diagonal sub-matrix hidden in a covariance matrix. Their 
methods are not spectral either. 

We extend the current work on detection in two directions. First, we analyze 
detection in the framework of sparse PCA, and more precisely, in the spiked 
covariance model. Second, while all the literature on this topic is presented in an 
asymptotic framework, we propose a finite sample analysis of our problem with 
results that hold with high-probability. These results show the delicate interplay 
between the important parameters of the problem: the ambient dimension, the 
sample size and the sparsity. 

Note that the spiked covariance model is particularly amenable to spectral 
methods due to its rotational invariance. It turns out that the so-called /c-sparse 
largest eignvalue can be used to construct an optimal test. However, construct- 
ing this test raises some technical difficulties and can even be proved to be NP- 
complete. As a result, a large body of the optimization literature on this topic con- 
sists in numerical methods to overcome this issue (see, e.g., d'Aspremont et al., 
2008, 2007; Journee et al, 2010; Lu and Zhang, 2011; Ma, 2011a, and references 
therein). Nevertheless, while these methods do produce a solution, their statis- 
tical properties are rarely addressed for the estimation problem and never for 
the detection problem. One of the approaches introduced by d'Aspremont et al. 
(2007) uses a convexification technique called semidefinite programming (SDP). 
A major drawback of this technique is that it may not output a vector v but a 
matrix and an ad hoc post-processing step is often required to turn this matrix 
back into a vector. However, in the context of detection our goal is not estimate 
the eigenvector v but rather its associated eigenvalue. This notable difference al- 
lows us to even bypass SDP optimization, which is known to scale poorly in very 
high dimension. Inspired by the SDP formulation, we propose a simple test pro- 
cedure based on the minimum dual perturbation (MDP) that is easy to compute 
and for which we can derive near optimal performance bounds for the detection 
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problem. 

Most of our analysis is performed in the spiked covariance model for Gaus- 
sian random vectors. Nevertheless, our results are robust to variations around 
this model and we spend Section 7 discussing various weaker assumptions under 
which our results still hold. In particular, we only need that our estimated co- 
variance matrix belongs to a small box around the true covariance matrix with 
high probability. This setup encompasses biased estimators or adversarial noise. 

The rest of the paper is organized as follows. In Section 2, we introduce the de- 
tection problem for sparse principal components. In Section 3, we discuss various 
links with classical and more recent results on random matrix theory and more 
precisely, the asymptotic effect of low rank perturbations to Wishart matrices. 
Our main results are contained in Section 4, where in particular, we introduce a 
new test based on spectral methods and derive the level at which it achieves de- 
tection of sparse principal components with high probability. This level is proved 
to be optimal in a minimax sense in Section 5. Unfortunately, this test cannot be 
computed efficiently and several relaxations are proposed in Section 6. As men- 
tioned above, Section 7 discusses various weaker assumptions under which our 
results hold. Finally the performance of our test is illustrated on several numerical 
examples in Section 8. 

Notations. The space of d x d symmetric real matrices is denoted by S^, and 
the cone of semidefinite positive matrices is denoted by S^~. We write equivalently 
Z G S+ and Z h 0. 

For any q > 1 we denote by \v\ q the £ q norm of a vector v and by extension, 
we denote by \v\q its so-called "£$ norm", that is its number of nonzero elements. 
The elements of a vector v G R p are denoted by v\ , . . . , v p and similarly, a matrix 
Z has element on its ith row and jth column. Furthermore, by extension, for 
Z G Sd, we denote by \Z\ q the £ q norm of the vector formed by the entries of Z. 

The trace and rank functionals are denoted by TV and rank respectively and 
have their usual definition. The identity matrix in R p is denoted by I p . For a 
finite set S, we denote by |S| its cardinality. We also write As for the | *S' | x \S\ 
submatrix with elements (Aij)ij^Sj an d vs for the vector of R' 5 ' with elements 
Vi for i G S. Finally, for two real numbers a and b, we write a A b = min(o, b), 
a V b = max(a, b), and a + = a V . 

2. SPHERICITY TEST WITH SPARSE ALTERNATIVE 

Let Xi, . . . , X n be n i.i.d. realizations of a random variable X in R p . Our 
objective is to test the sphericity hypothesis, i.e., that the distribution of X is 
invariant by rotation in R p . For a gaussian distribution, this is equivalent to 
testing if the covariance matrix of X is of the form cr 2 I p for some known a 2 > 0. 

Without loss of generality we may assume a 2 = 1 so that the covariance matrix 
is the identity in R p under the null hypothesis. Possible alternative hypotheses 
should encompass the idea that there exists a privileged direction, along which X 
has more variance. Of course, there are many possible characterizations for this 
alternative and, in the spirit of sparse PCA, we focus on the case where there the 
privileged direction is sparse. Therefore, we consider the alternative hypothesis 
where the covariance matrix is a sparse rank one perturbation of the identity I p . 
Formally, let v G R p be such that \v\2 = l,Mo < k, and 6 > 0. The hypothesis 
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testing problem studied throughout this paper is: 



H 
Hi 



X~Af(0,I p ) 

X ~N(0,I p + 8vv T ). 



Note that the model under Hi is a generalization of the spiked covariance model 
since it allows v to be fe-sparse on the unit Euclidean sphere. In particular, the 
statement of Hi is invariant under rotation on the k relevant variables. 

Denote S the covariance matrix of X. A useful statistic in these settings is the 
empirical covariance matrix £ defined by 



It is an unbiased estimator for the covariance matrix of X, the maximum likeli- 
hood estimator in the gaussian case, when the mean is known to be 0. Further- 
more, it is oftentimes the only data provided to the statistician. 

We say that a test discriminates between H$ an d H\ with probability 1 — 6 if 
the type I and type II errors both have a probability smaller than 5. Our goal is 
therefore to find a statistic and levels tq < T\ , depending on (p, n, k, 5) such 



Taking r G [tq, ti] gives us control over the type I and type II errors of the test 



where 1{-} denotes the indicator function. As desired, this test has the property 
to discriminate between the hypotheses with probability 1 — 5. 



Note that under the null hypothesis, the sample covariance matrix £ follows a 
rescaled Wishart distribution. The spectrum of such matrices has been extensively 
studied and is fairly well understood. We give below a quick overview of the results 
that are relevant to our problem. 

3.1 Spectral methods 

It is not hard to see that, under Hi, for any 6 > 0, v is an eigenvector asso- 
ciated to the largest eigenvalue of the population covariance matrix S, without 
further assumption. Moreover, if £ is close to £ in spectral norm, then its largest 
eigenvector should be a good candidate to approximate v. It is therefore natural 
to consider spectral methods for the spiked covariance model. Understanding the 
behavior of our test statistic under both the null and the alternative is key in 
proving that it discriminates between the hypotheses. 

Convergence of the empirical covariance matrix to the true covariance matrix in 
spectral norm has received some attention recently (see, e.g., Bickel and Levina, 
2008; Cai et al., 2010; El-Karoui, 2008) under various elementwise sparsity as- 
sumption and using thresholding methods. However, since our assumption allows 




»=i 



that 



P Ho ( i p(Z)>T )<6 
Pif! (¥>(£) <ti) < 5. 



</>(£) = IMS) > r} 



3. LINK WITH RANDOM MATRIX THEORY 
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for relevant variables to produce arbitrary small entries under the alternative hy- 
pothesis, we cannot use such results. A natural statistic to discriminate between 
null and alternative would be, for example, using the largest eigenvalue of the 
covariance matrix. 

Spectral properties of random matrices have received a lot of attention from 
both a statistical and probabilistic angle. We devote the rest of this section to 
review some of the classical results from random matrix theory to argue that even 
in moderate dimensions, the largest eigenvalue cannot discriminate between the 
null and alternative hypotheses. 

It is easy to notice that for any unit vector v 



If we could allow, for a fixed p, to let n go to infinity, the consistency of 
the estimator E (for fixed p, entry by entry) and the continuity of the largest 
eigenvalue as a function of its entries would prove that we have an efficient method 
to discriminate between the two alternatives, at least asymptotically. 

However, in a high dimension setting, where p may grow with n, the behavior 
of A max (S) under the null hypothesis is different. If pjn — > a > 0, Geman (1980) 
showed that, in accordance with the Marcenko-Pastur distribution, we have 



were the convergence holds almost surely (see also Bai, 1999; Johnstone, 2001, 
and references therein). Moreover, Yin et al. (1988) established that JE(X) = 
and E(X 4 ) < oo is a necessary and sufficient condition for this almost sure 
convergence to hold. Furthermore, as E € St, its number of positive eigenvalues 
is equal to its rank (which is smaller than n), and we have 



As the sum of np squared norms of independent standard gaussian vectors, 



der the null hypothesis. 

These two results, indicate that the largest eigenvalue will not be able to dis- 
criminate between the two hypotheses unless 9 > Cp/n for some positive con- 
stant. In a "large p/small n" scenario, this corresponds to a very strong signal 
indeed. In the next subsection, we show that such results can be made more 
formal using perturbation theory. 

3.2 Low rank perturbations of Wishart matrices 

In a slightly different setting, Baik et al. (2005) established that when adding 
a finite rank perturbation to a Wishart matrix, a phase transition arises already 
in the moderate dimensional regime where pjn — )■ ot € (0, 1). This phenomenon 
is known as the BBP transition for the name of the authors. A very general class 
of random matrices exhibits similar behavior, under finite rank perturbations, as 
shown by Tao (2011). These results are extended to more general distributions 
in Benaych-Georges et al. (2011). 

Assume that 
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W n = - Y WJ and W n = 9vv T + W n . 

i=l 

Qualitatively, the BPP phase transition predicates that there exists a critical 
value 0* such that if 9 > 0*, the spectrum of W n exhibits an isolated eigenvalue 
significantly larger than the others, and that if 9 < 9*, the spectrum has a very 
similar behavior under the two hypotheses. 

Even when p/n — > a € (0, 1), Benaych- Georges et al. (2011) show that if 9 < 
a+\/a, the leading eigenvalue will have limit (l + y / a) 2 , i.e., the same as under the 
null hypothesis. Similarly, the random fluctuations around this limit will follow 
the the Tracy- Widom distribution, for both hypotheses. 

The above analysis indicates that detection using the largest eigenvalue is 
impossible already for moderate dimension, without further assumptions. Nev- 
ertheless, resorting to the sparsity assumption allows us to bypass this intrinsic 
limitation of the largest eigenvalue as a test statistic. 

3.3 Sparse eigenvalues 

To exploit the sparsity assumption, we use the fact that only a small submatrix 
of the empirical covariance will be affected by the perturbation. Let A be a p x p 
matrix and fix k < p. We define the fe-sparse largest 1 eigenvalue by 

(3.1) A^(A) = maxA max (A 5 ). 

|6|=fc 

We have the same equalities as for regular eigenvalues. 

A*ax(^) = l and A^(J P + 9vv T ) = 1 + 9. 

However, the /c-sparse eigenvalue behaves differently under the two hypotheses 
as soon as there exists akxk matrix with a significantly higher largest eigenvalue. 
The BBP transition, in a similar setting, indicates that this is true as soon as 9 > 
7 + when k/n — > 7 > 0. Therefore, it appears that the sparsity assumption 
can be exploited us to significantly reduce the dimensionality of the problem. 

4. SPARSE PRINCIPAL COMPONENT DETECTION 

In sparse principal component detection, we are testing the existence of a 
sparse direction v with a significantly higher explained variance v T T>v than any 
other direction. Following the motivation of the previous section, we study the 
properties of the test statistic </?(S) = A^ ax (S), where we recall that A^ ax (S) 
is the fc-sparse eigenvalue of £ and can be defined equivalently to (3.1), for any 
A e S+ by 

(4.2) ^(A) = max x 1 Ax . 

R2=l 

Ho<* 

In the rest of the paper, we drop the qualification "largest" since we only refer to this one. 
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4.1 Deviation bounds for the fc-sparse eigenvalue 

Finding optimal detection levels amounts to finding the right order of magni- 
tude of the deviations of the test statistic A max (£) both under the null and the 
alternative hypotheses. We begin by the following proposition, which guarantees 
that our test statistic remains large enough under the alternative hypothesis. 

Proposition 4.1. Under Hi, we have with probability 1 — 5, 



AL^)>l+«-2(l + <"> g(1/i) 



n 



Proof. Under Hi, there exists a unit vector v with sparsity k, such that 
X ~ Af(0, I p + 6 vv T ). Therefore, we have 



\ k m3X (£)>v T £v = ±-J2(x' i 



n 
i=l 



by definition of S. Since X ~ Af(0, I p + 8 vv T ), we have X T v ~ A/"(0, 1 + 6>). 
Define the random variable 



Using Laurent and Massart (2000, Lemma 1) on concentration of the x 2 distri- 
bution, we get for any t > 0, that 

P ( Y < -2 v / t7n) < e~* . 



Hence, taking t = log(l/<5), we have Y > — 2-y/log(l/5)/n with probability 1 — 5. 
This yields the desired inequality. ■ 

Note that our proof relies only on the existence of a sparse vector v associated 
to the eigenvalue (1 + 6) of the population covariance matrix E. In particu- 
lar, the result of Proposition 4.1 extends to more general alternative hypotheses 
as long as they satisfy this condition. Moreover, observe that the above lower 
bound is independent of p and k. Proposition 4.1 suggests that the spiked co- 
variance model is well separated from the spherical model where 9 = 0. Note 
that much more than detection can actually be achieved under this model. In- 
deed, Amini and Wainwright (2009) prove optimal rates of support recovery for 
v in the case where 6 is known and v takes only values in {0, ±l/vfc}. 

We now study the behavior of the /c-sparse eigenvalue under the null hypothe- 
sis, i.e., for a Wishart matrix with mean I p . We adapt a technique from Vershynin 
(2010) to obtain the desired deviation bounds. 

Proposition 4.2. Under H , with probability 1 — 5 



y , , E , , + / fclog(9 e p/fc)+log(l/a) + 4 k log(9ep/k) + log(l/5) 



n n 
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Proof. Using a 1/4-net over the unit sphere of R fc , it can be easily shown (see, 
e.g., Vershynin, 2010) that there exists a subset A4 of the unit sphere of R fc , with 
cardinality smaller than 9 k , such that for any A G 

(4.3) A max (^) < 2 max x T Ax . 

Under Hi, we have 

Amax(S) <l + max |A max (Ss) - l| , 

where the maximum in the right-hand side is taken over all subsets of {1, ... ,p} 
that have cardinality k. 

Moreover, for all u E R fc , \11\2 = 1 and S C {1, . . . ,p} such that \S\ = k, let 
u G R p be the vector with support S such that us = u. We have 

1 n 

u T ± s u - 1 = u T ±u - 1 = - V \(u T Xi) 2 - 1 

i=l 

Since \ti\2 = H2 = 1, Laurent and Massart (2000, Lemma 1) yields for any t > 0, 



(4-4) r(i£[i 

\ 8=1 



n T X,,) 2 - 1 



> 2t/- + 2- ) < e" 



For any 5 C {1, . . . ,p}, define R 5 to be the subset of R p defined such that 
x G R s iff Xj = , V j $l S. Let Mk{S) be a subset of the unit sphere of R s , with 
cardinality smaller than 9 k such that for any A G , inequality (4.3) holds with 
A4 = Afk(S). Fix £ > and define the event As by 

^ = {US S )-1>4^ + 4^J . 

Observe that a union bound over the elements of Mk(S) together with (4.4) yields 
that for any t > 0, 

1 n IT t 

P(„4 5 )<P( max -Y(v T Xi) 2 -l>2J- + 2-) < 9 fc e~* . 

Let now ^4 be the event defined by 

-4= u^=Wwfi S )-i}>y^ + 4i}. 

\s\=k V 1 "J 

Therefore, by a union bound on the (|) subsets S" of {1, ... ,p} that have cardi- 
nality k, we get 



P l AUP)>l+4y- + 4-j<P M <g9«e 

To conclude our proof, it is sufficient to use the standard inequality (^) < (^) fc 
and to take t = klog(9ep/k) + log(l/5). ■ 
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4.2 Hypothesis testing with A^ ax 

Using these results, we have, with the notations from Section 2, 

P Ho (X k max (t) > t ) < 5 
P Hl (X k miiX (t)<T 1 )<5, 

where tq and t\ are given by 



( /fclog(9ep/A;) + log(l/5) .Hog(9ep/fc) + log(l/<5) 

Tq = 1 + 4\ V 4- 



n n 



n = i + e - 2(1 + 



'log(l) 



I? 



Whenever > to, we take r G [to,ti] and define the following test 

i;(£) = l{\ k max (£)>T}. 

It follows from the previous subsection that it discriminates between H\ and Hq 
with probability 1 — 5. 

It remains to find for which values of 9 the condition t\ > To- It corresponds 
to our minimum detection level. 

Theorem 4.1. Assume that k,p,n and 5 are such that 9 < 1, where 



'Hog(^)+log(|) Hog(^)+log(i) , Jlog(±; 



(4.5) 6»:=4W ^A^ + 4 Z^JL^ + M 

\ n n y n 

Then, for any 9 > 9 and for any r G [to,ti], i/ie iesi ^(S) = l{A^ ax (S) > t} 
discriminates between Hq and H\ with probability 1 — 5. 

If we consider asymptotic regimes, for large p, n, k, taking 5 = p~^ with (3 > 
0, provides a sequence of tests tp n that discriminate between Hq and Hi with 
probability converging to 1, for any fixed 9 > 0, as soon as 

fe log (g) _^ 

n 

5. MINIMAX LOWER BOUNDS FOR DETECTION 

The goal of this section is to prove that if 9 > C9 for some C > 0, where 9 is 
defined in (4.5), then no test can discriminate between i^o an d Hi with arbitrarily 
small probability. We will see that this result can be achieved up to logarithmic 
terms that vanish for interesting regimes of p, n and k. Throughout this section, 
assume that 9 < 1/ Vi- 
lli order to find lower bounds for the probability of error, we study the x 2 
distance between probability measures (see, e.g., Tsybakov, 2009, chapter 2). For 
any v G R p such that \v\2 = 1, define the matrix Y, v = I p + 9vv J and let P v 
denote the distribution of a Gaussian random variable X ~ A/"(0, T, v ). Moreover, 
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let S = {S C {1, . . . ,p} : \S\ = k} and for any S £ S, define u(S) £ R p to be the 
unit vector with jth coordinate equal to \j\fk if j G S and otherwise. Finally, 
define the Gaussian mixture P5 by 

Ps= m£ p » <s) ' 

The following theorem holds. 

Theorem 5.1. Fix v > and define 9 > by 



(5.1) 



/clog (up/k 2 + 1) 



n 

Then there exists a constant C v > such that for any 9 < 9 A 1/ y/2, it holds 



inf < 



P "(^ = 1) v max P„ n (* = 0) } > C v . 

\v\i=\ 
\v\ <k 



where the infimum is taken over all possible tests, i.e., measurable functions of 
the observations that take values in {0, 1}. Moreover, by taking v small enough, 
C v can be made arbitrary close to 1/2. 

We write for simplicity Pg = P Ms when this leads to no confusion. Our proof 
relies on the following lemma. 

Lemma 5.1. For any S,T G S and any 9 < 1/2, it holds 

E -(^^)-( 1 -^(W)-' /2 . 

PROOF. Fix S £ S and observe that 

dP5 = detQg 1 / 2 exp(-^ T S; ( 1 g) X) 
dP l ' det(E 5 ) 1 /2 exp(-iX T / p 1 X) 

Furthermore , since det(ip) = 1 and !?/(£) I2 = 1, we get by Sylvester's determi- 
nant theorem that 

det(E tt(5) ) = det(/ p + 9u{S)u(S) T ) = 6eb(h + 9 u{S) T u{S)) = 1 + 9. 

Moreover, the Sherman-Morrison formula yields 

S, 1 = (I p + 9u(S)u(S)T)-i =I p - . 
By substitution, the above three displays yield 

dPs <x) = ^^(\A-Jx T u(s))'< 



dP v ' VTT9 1 \21 + 
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and 
(5.2) 



dPg dP T 
dP dP 



(X) 



1 + 



exp(X T MX) 



where M is denned by 



M 



1 



-{u{S)u(S) T +u(T)u(T) 



21 + 

Note that M has at most two non-zero eigenvalues given by 



A 



1 



i — o 



;(1-»(S) T »(T)) < - , 



and let A denote the diagonal matrix with elements (Ai, A2, 0, . . . , 0) G R p . 
Together with (5.2), it yields 



dP 5 dP T 
dP dP 



1 



1 + . 
1 

IT 
1 

1 



■TE Fo [exp(X T MX)} 

-lE Fo [exp(X T AX)] 

-Ep [exp(Ai Xf)] lE Po [exp(A 2 X 2 2 )] 



;i-2Ai)(l-2A 2 )]' 



-1/2 



where, in the second equality, the substitution of M by A is valid by rotational 
invariance of the distribution of X under Po . The last equation yields the desired 
result. ■ 



We now turn to the proof of Theorem 5.1 
Proof. Observe now that 



X 2 (P 5 ,Po) = IEp 



/dP 5 \ 2 1 _ 1 /dP 5 dP T \ 



1 . 



Lemma 5.1 together with the fact u(S) T u(T) = \S H T\/k yields 



(5.3) 



2, p p ^ V- I C{S,r) 
X (Ps,Po) = 2^ \ ~\C\2~ 
r=0 I 1 1 



2\ -1/2' 



1 



where C(S, r) denotes the number of subsets S,T £ S such that |5 D T\ = r. 

We now control the term on the right-hand side of (5.3). Let S,T be chosen 
uniformly at random in 5, and observe that ~P(\S (~)T\ = r) = P(R = r), where 
R = |5n{l,...,ife}|. It yields 



x 2 (p 5 n ,p n )=n( i +x 2 (P5,Po))-i 



i=l 



isnri 



k 2 

-n/2" 



-n/2' 



1. 
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where IE5 denotes the expectation with respect to the random subset S and E# 
the expectation with respect to the hypergeometric random variable R. 

, nt 

Using now the convexity inequality (1 — t)~ n ' < e 2(1_t) < e valid for 1 — t > 
1/2, and noticing that R < k, the above display leads to 



(5.4) 



x<(pg,-p?)<m R 



exp 



A; 



1. 



Define fj? = n9 2 /k. We have, as in Addario-Berry et al. (2010); Arias-Castro et al. 
(2011) that 

" k 

E 5 l[exp{^l{i e S}) - J 



.1=1 



< ]J E s [exp(// 2 l{i e 5})] - 1 , by negative association 



< 



1 



k 
P 

Assume now that 9 < 9_.lt yields 



+ 1 



1. 



^ 2 R 



< 



D- + i 

p 



1 < 



-1 



\k 2 



1 < e v 



1. 



Together with (5.4) it yields x 2 (P,s ,P n ) < e 1 ' - 1. 

We are now in a position to apply standard results from minimax theory. Note 
that for all measurable tests ^, we have 

P "(* = 1) V max P w "(* = 0) > P "(* = 1) V maxP tt n (S) (* = 0) 



> P™(^ = 1) VP < | 1 (^ = 0) 
1 - - l)/2 



> 



V 



4 2 

where the last inequality is a direct consequence of Tsybakov (2009, Theorem 2.2, 
case (to)). Observe that C v — > 1/2 if u — > 0. | 



We observe a gap between our upper and lower bound, with a term in log(p/k) 
in the upper bound, and one in log(p/A; 2 ) in the lower bound. This gap has been 
observed in the detection literature before (see, e.g., Baraud, 2002; Verzelen, 2012, 
for an explicit remark) and, to our knowledge has never been addressed. However, 
by considering certain regimes for p, n and k, it disappears. Indeed, as soon as 
p > k 2+£ , for some e > 0, upper and lower bounds match up to constants, and 
the detection rate for the sparse eigenvalue is optimal in a minimax sense. Under 
this assumption, detection becomes impossible if 

k log (p/k) 
n 

for a small enough constant C > 0. 

imsart-sts ver. 2009/08/13 file: BerRigl2_arxiv.tex date: February 24, 2012 



< c 



OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS 



13 



6. SEMIDEFINITE METHODS FOR SPARSE PRINCIPAL COMPONENT 

TESTING 

Computing the largest /c-sparse eigenvalue A^ ax of a symmetric matrix A is, 
in general, a hard computational problem. To see this, consider the particular 
case where A is a p x p symmetric matrix with values in {0, 1} and An = 1 
for all diagonal entries, so that A corresponds to the adjacency matrix of an 
undirected graph. It is not hard to see that A^ ax (A) < k, with equality if and 
only if the graph of A contains a clique of size k. Yet, it is a well known fact 
of computational complexity (see, e.g., Sipser, 1996) that the decision problem 
associated to finding whether a graph contains a clique of size k is NP-complete. 
Note that if k were fixed, this problem would actually be polynomial in the size 
p of the graph since there are "only" (?) < p k subgraphs to enumerate. However, 
the exponential dependence in k is clearly concerning even for moderate values 

of k. 

6.1 Semidefinite relaxation for A^ ax 

Semidefinite programming (SDP) is the matrix equivalent of linear program- 
ming. Define the Euclidean scalar product in by (A, B) = Tr (AB). A semidef- 
inite program can be written in the canonical form. 

(6.1) SDP = max. Tr(CX) 

subject to Tv(AiX) <b i} Vi G {1, . . . , m} 

x y o 

As convex problems, they are computationally efficient and can be solved us- 
ing interior point or first order methods (see, e.g., Boyd and Vandenberghe, 2004; 
Nesterov and Nemirovskii, 1987). Using SDP relaxations of problems with non- 
convex constraints is a common method to find an approximate solution. Tight- 
ness bounds are sometimes proven (see, e.g., Goemans (1995) for the MAXCUT 
problem). A major breakthrough for sparse PCA was achieved by d'Aspremont et al. 
(2007), who introduced a SDP relaxation for A^ ax , but tightness of this relax- 
ation is, to this day, unknown. Our task is not as difficult though. Indeed, we only 
need to prove that the SDP objective criterion has significantly different behavior 
under Hq and H\. 

Making the change of variables Z = xx T , in (4.2) yields 

Tr(AZ) 

Tr(Z) = 1, \Z\ < k 2 
Z y 0, rank(Z) = 1 . 

Note that this problem contains two sources of non-convexity: the £o norm con- 
straint and the rank constraint. We make two relaxations in order to have a 
convex feasible set. 

First, for a semidefinite matrix Z, with trace 1, and sparsity k 2 , the Cauchy- 
Schwarz inequality yields \Z\\ < k, which is substituted to the cardinality con- 
straint in this relaxation. Simply dropping the rank constraint leads to the fol- 



A m a X (^) = max. 

subject to 
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lowing relaxation of our original problem: 

(6.2) SDP fc (^) = max. Tr(AZ) 

subject to Tr(Z) = 1, \Z\i < k 

z to. 

Note that this optimization problem is convex since it consists in minimizing a 
linear objective over a convex set. Moreover, it is a standard exercise to show 
that it can be expressed in the canonical form (6.1). As such, it can be solved 
efficiently using any of the aforementioned algorithms. 

As a relaxation of the original problem, for any A ^ 0, it holds 

(6.3) X k max (A)<SDP k (A). 

Since we have proved in Section 4 that A^ ax (X) takes large values under H\, this 
inequality tells us that using SDPfc(S) as a test statistic will be to our advantage 
under H\. It remains to show that it stays small under H$. This can be achieved 
by using the dual formulation of the SDP. 

Lemma 6.1. (Bach et al., 2010). For a given A y 0, we have by duality 

SDP fe (A) = min {A max (^ + U) + tyU]^} . 

Together with (6.3), Lemma 6.1 implies that for any z > and any matrix U 
such that \U\oo < z, it holds 

(6.4) X k mSLX (A) < SDP fe (A) < A max (^ + U) + kz . 

A direct consequence of (6.4) is that the functional A^ ax is robust to perturbations 
by matrices that have small | • |oo-norm. Formally, let A y be such that its largest 
eigenvector has £q norm bounded by k. Then, for any matrix N, (6.4) yields 

\ k max (A + N)< A max ((A + N) — N) + klN^ = \ k m3X (A) + k^ . 

6.2 High probability bounds for convex relaxation 

We now study the properties of SDPfe(S) and other computationally efficient 
variants as test statistics for our detection problem. Recall that SDP/u(S) > 
A^ iax (S). In view of (6.3), the following proposition follows directly from Propo- 
sition 4.1. 

Proposition 6.1. Under Hi, we have, with probability 1 — 5 

SDP fc (S) > 1 + 9 - 2(1 + g)W log ^/^ . 

V n 

Akin to Proposition 4.1, Proposition 6.1 shows that Hi is the easy case. Indeed, 
under Hi, the lower deviations of SDP^(S) remain small and do not depend on 
k or p. We now turn to the upper deviations under Hq. 
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Proposition 6.2. Under H , we have, with probability 1 — 5, 



SDP fc (E) < 1 + 2 / 2 MVV*) + 2 k_M^l + JM2P/S) + 2 ^(2 P /S) 



n n V n n 



PROOF. Let st z (A) be the soft-threshold of A, with threshold z defined by 
(st z (A))ij = sign (Aij)(\Aij\ — z) + . It follows from (6.4) that 

(6.5) Aj^JA) < SDP fc (A) < X ma , x (st z (A)) + kz . 

Let A = diag(S) be the diagonal matrix with the same diagonal entries as £, 
and \l/ = S — A the matrix of its off-diagonal entries, so that £ = A + *F. Since 
and A have disjoint supports, it follows that 

(6.6) st z (t) = st z (A) + st x (V) . 

We first control the largest off-diagonal element of S by bounding |^|oo with 
high probability. For every i,j, we have 



n 
k=l 



— XkiXkj 

n £ — * J 

=1 

n i " 1 

^[-(X fei + A%f - 1] - £[9 - ^) 2 - 1] 



2 V 

fe=l fc=l 



Under f^O) we have X ~ AA(0, so by Laurent and Massart (2000, Lemma 1), 
it holds for t > that 

p(|* i ,|>2 v /I + 2l)<4e-'. 
Hence, by union bound on the off-diagonal terms, we get 

pf max|^i,| > 2\ - + 2-) < 2p 2 e~ l . 
V i<j y n nJ 

Taking t = log(4p 2 /<5) yields with probability 1 — 5/2, that \^\oo < z, where 



(6.7) z= /lo g (VA) +2 log(4pV^) 



n n 



Next, we control the largest diagonal element of £ as follows. We have by 
definition of A, for every i 



A,; 



1 



n . 



Applying Laurent and Massart (2000, Lemma 1) and a union bound over the p 
diagonal terms, we get 

pf max An > l + 2\ - + 2-) <pe~ l . 
\ i<i<p V n nJ 
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Taking t = log(2p/<5) yields yields with probability 1 — 5/2 

(6.8) „A i < 1+ 2,/gM + 2W' 
i<i<p V n n 

To conclude the proof of Proposition 6.2, observe that (6.5) and (6.6) implies 
that for all z > 0, we have 

A^ ax (S) < SDP fc (£) < A max (st z (£)) + kz 

< A max (st 2 (A)) + A max (st 2 (\E')) + /cz . 

where the last inequality follows from (6.6) and the triangle inequality for the 
operator norm. 

Note now that if we take z as in (6.7), then st z (*&) = with probability 1 — 5/2. 
Furthermore, since A is a non negative diagonal matrix, then 

(6.9) A max (s^(A)) < A max (A) = max Aj, . 

l<i<p 

Using that (6.8) holds with probability 1 — 5/2, and a simple union bound allows 
us to ensure that the desired inequality is valid with probability 1 — 5. | 



6.3 Hypothesis testing with convex methods 

Using the notation from Section 2, the results of the previous subsection can 
be written as 

P^ (SDP fc (S) >f ) <5 
P ffl (SDP fc (E)<fi)<<y, 
where fo and fi are given by 

~ q = 1 + 2 / gjogggj | 2 klog(Ap 2 /5) | 2 j log(2p/5) | 2 log(2p/^) 
V n n V n n 

n = 1 + ^-2(1 + ^/55®. 

V n 

Whenever fi > fo, we take r € [fo, f%] and define the following computationally 
efficient test 

^(S) = l{SDP fc (S)>r}. 
It discriminates between Hi and Hq with probability 1 — 5. 

It remains to find for which values of 9 the condition f\ > fo holds. It corre- 
sponds to our minimum detection level. 

Theorem 6.1. Assume that p,n,k and 5 are such that < 1, where 
(6-10) 

- / gjcgggg | 2 Hog(4pVg) | j \og{2 P JS) | 2 log(2p/^) | 4 /log(|) ^ 
Vn n V n n y n 

Then, for any 9 > 9, any r £ [fo,fi], the test ^>(S) = 1{SDP^(S) > r} discrimi- 
nates between H and Hi with probability 1 — 5. 
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If we consider asymptotic regimes, for large p, n, k, taking 5 = p~@ with /3 > 
0, provides a sequence of tests ip n that discriminate between Hq and H\ with 
probability converging to 1, for any fixed 6 > 0, as soon as 

k 2 log (p) _^ 
n 

Note that, compared to Theorem 4.1, the price to pay for using this convex 
relaxation is to multiply the minimum detection level by a factor yk. Of course, 
in most examples, k remains small so that this is not a very high price. 

6.4 Simple methods 

While the SDP relaxation proposed in the previous subsection is provably 
computationally efficient, it is also known to scale poorly on very large problems. 
Simple heuristics such as the diagonal method of Johnstone (2001) become more 
attractive for larger problems. A careful inspection of the proofs in the previous 
subsection is very informative. It indicates that our results not only hold for 
the test iftpZ) but for a simpler statistic arising from the dual formulation (6.5). 
Indeed, to control the behavior of SDPfe(S) under Hq, we showed that it was no 
larger than the minimum dual perturbation MDPfc(S) defined by 

(6.11) MDP fc (S) = min{A max (^(S)) + fcz} . 

Clearly MDP fc (S) > SDP fc (S) > A^ ax (S) so that both Proposition 6.1 and Propo- 
sition 6.2 still hold for SDP(S) replaced by MDP(S). As a result, for any 6 > 6 the 
test = 1{MDP^.(S) > t} discriminates between Hq and H\ with probability 
1-5. 

Actually, a detection level of the same order as 9 holds already for an even 
simpler test statistic: the largest diagonal element of E. This method called John- 
stone's diagonal method was first proposed by Johnstone (2001) and later studied 
by Amini and Wainwright (2009). For the problem of detection considered here, 
it dictates to employ the test statistic 

D(S) = max Sjj . 

l<i<p 

Using even simpler techniques than in Propositions 6.1 and 6.2, it is not hard to 
show that 

P^ (D(S)>r d )< ( 5 
P Hl (D(S) <rf)<6, 

for levels Tq and rf given by 



^i + Ie-sA+^V/Mi/*) 



k \ k I V n 



log(p/<5) log(p/£) 



n n 



However, as we shall see in Section 8, MDP^ behaves much better than D in prac- 
tice. It was proved by Amini and Wainwright (2009) that if the SDP (6.2) has a 
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solution of rank one then it is strictly better than Johnstone's diagonal method. 
While they study a support recovery problem different from the detection prob- 
lem considered here, it seems to indicate that the two methods are qualitatively 
different. However, the assumption that the SDP (6.2) has a solution of rank one 
is very strong and unnecessary in our problem. Indeed, while rank one solutions 
are amenable to extracting sparse eigenvectors, we only need the value of the 
objective at its maximum and not the solution itself. In this sense, we avoid the 
main limitation of SDP relaxations to vector problems. 

7. GENERALIZATION WITH WEAKENED ASSUMPTIONS 

In this section we investigate two extensions of our original problem. For sim- 
plicity, we denote by *DP/% any of the two functionals MDP^ or SDP^. 



While the results for A max rely heavily on the fact that the Xj are Gaussian 



7.1 Adversarial noise 

\ k 

random vectors, it is not the case for those on the convex relaxation. We can find 
that under much weaker assumptions, the results for detection using the SDP 
statistic are still valid. We also describe an adversarial noise setting where we 
prove that these detection levels are optimal. 

In this setting, for an original covariance matrix S, assume that 

(7.1) £ = £ + iV. 

Where the only assumption on N is that \N\qq < y/\og(jp/5) jn with proba- 
bility 1 — 5. Up to a constant, this is a generalization of our initial setting, 
and can describe a situation where the data is censured, akin to the setting 
of Loh and Wainwright (2011). Here, however, the situation is more general, as 
the censured entries are not necessarily chosen randomly. 

We show below that the high probability bounds for A max (E), SDPfc(S) and 
MDPfc(E) under Hq and Hi that were constructed before depend only on this 
very mild assumption. 

Proposition 7.1. Under Hi, we have with probability 1 — 5 



,DP*(£) > A^(£) >l + «- kv /«. 

Proof. Recall that for any v such that \v\o < k, we have 

*DP fc (S) > A^ ax (S) > v T tv >v T (I p + 9vv T )v + v T Nv 

> 1 + 6- \N\oo\v\l 

> i + e-kiNi^, 

which yields the desired result. 

Proposition 7.2. Under H , we have with probability 1 — 5 



AU(S)<*DP t (E)<l + .,/ 10 ^ 



n 
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PROOF. It follows from (6.4) that 

X k max (t) < *DP k (t) < A max (/ p ) + , 
which yields the desired result. ■ 

The following theorem follows from Proposition 7.1 and Proposition 7.2. We 
omit its proof. 

Theorem 7.1. Let il) adv be the test defined by 

^(E)=J*DP fc( s)>i + y^^l 



2 V n 



i 



Then the test if; adv discriminates between Hq and H\ with probability 1 — 5 as 
soon as 

9 > 2kJ l ° gip/S) . 
V n 

The lower bound proved in Section 5 can be extended to encompass the ad- 
versarial setup of this section. The next theorem gives a lower bound on the 
detection level that holds for an adversarial noise that is bounded in | • |oo norm. 
Note that the lower bound in Theorem 7.2 below is not minimax since there exists 
one model under which all tests cannot discriminate between Hq and Hi with 
probability less than 1/2. The model is the following. Let v = (vi, . . . , v p ) T G R p 
be such that Vj = 1/vk if j < k and Vj = otherwise. Define the random matrix 
N that takes values ±|iw T , each with probability 1/2. 

Theorem 7.2. There exists an adversarial model of the form (7.1) where 
\N\oq < y / log(p) jn almost surely, such that for any test ^(S) G {0, 1} it holds 

P Hl MS) = 0) V P^ (V(S) = 1) = 1/2 . 



as soon 



as 6 < 2kyJ\og{p)/n. 



Proof. Note first that the matrix N defined above satisfies the assumptions 
on the noise since almost surely, we have 



Therefore, it holds that 

P Ho (t = I p + ^vv T ) = ^ 

P Hi (£ = L p + e -vv T ) = ±. 

Therefore, if tfj(I p + ^vv T ) = 1, then P^ (^(S) = 1) = 1/2 and if ip(I p + ^vv T ) = 
0, then P Hl (i/j(t) = 0) = 1/2. | 

Note that unlike Theorem 5.1, Theorem 7.2 gives a lower bound only for tests 
that depend on S. Nonetheless, in the adversarial model (7.1), these are the only 
tests that make sense since S is the only observation available. 
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7.2 Sparsity in terms of l\ norm 

Sparsity in terms of £q norm is actually very stringent and hardly occurs in 
real datasets. Rather, it may be more realistic to perform the test 



where |t;|f = oj for some small oj > 0. This allows for vectors v £ R p that have 
ordered coordinates that decay fast enough but never take value zero. It is not 
hard to see that our analysis extends to this case and the following theorem holds 
by following the same steps as the proof of Theorem 6.1. We provide it without 
proof. Recall that is defined in (6.10). 

Theorem 7.3. Assume that p,n,5 and k = oj are such that 6 < 1. Then, 
for any > and for any r E [to,ti], the test 1{*DP^(S) > t} discriminates 
between Hq and Hi with probability 1 — 5. 



Computation costs are a crucial element in this study. In Bach et al. (2010), 
the SDP relaxation with accuracy e is shown to have a total complexity of 
0(kp s y / log(p)/e). This is achieved by minimizing a smooth approximation of 
the dual function, using first order methods from Nesterov (2003). However, this 
polynomial cost is already prohibitive in a very high-dimensional setting, and we 
study only tests based on the MDP/% statistic. The purpose of this section is to 
illustrate the empirical behavior of tests based on MDP/% and to compare it with 
the diagonal method. 

8.1 Comparison of different methods 

We simulate N = 1, 000 samples of n independent random vectors X®, . . . , X® ~ 
J\f(0, I p ) and X\, . . . , X\ ~ A/"(0, I p + 0vv T ), for random unit vectors v supported 
on S = {1, . . . , k}. The vector vs is distributed uniformly on the unit sphere of 
dimension k. 

It yields N empirical covariance matrices . . . , under Hq and X of them, 
S}, . . . , S]y under H\. We compare the D and MDPfc statistics for these samples 
and compare their densities. We take = 4 and observe that the D statistic yields 
two distributions under Hq and H\ that are hard to distinguish (Figure 1, left). 
In particular, it is clear that the statistic D cannot discriminate between Hq and 
Hi for = 4, with this set of parameters. However, the distributions of MDPfc(S) 
under Hq and Hi have almost disjoint support so that it can discriminate between 
the two hypothesis with probability close to one. 

8.2 Tightness of error bounds 

In Section 6, we prove that both the D and MDP/% statistics discriminate be- 
tween Hq and Hi with high probability as long as 9 > Cky/\og(p/k)/n. The 
previous subsection indicates that MDP& actually performs better than D- It is 
pertinent to wonder if it performs as well as A^ ax . Answering this question would 
actually require implementing A^ ax , which is impossible even for a moderate 
problem size. 
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Diagonal statistic D MDPfe statistic 

Figure 1. For p = 500, n — 200, k — 30, N = 1,000, estimated densities for the two statistics, 
under Ho (whole line) and under Hi (dashed line) 

For MDPfc to be considered a performant approximation of A max , it would need 
to discriminate between Hq and Hi with high probability as soon as 9 is of the 
order wk log(p/k)/n, which is the minimax optimal detection level that is also 
achieved by A max . This behavior can be illustrated by showing a phase transition 
for the probability of error in the testing problem, as a function of 9, for different 
choices of (p,n,k). More precisely, there should exist a critical value 9 cr it and 
a constant C cr ; t , such that 9 > 9 crit = C cr ; t \J k log(p/k) Jn, the probability of 
type II error is very close to 0. Moreover, as a constant, C cr i t should not depend 
on (p, n, k). 

In order to substantiate such effects, it is actually more pertinent to use a 
reciprocal setting. For fixed 9q = 1 and several choices of parameters p, k, we 
exhibit a phase transition for the probability of error in the testing problem, as 
a function of 

r ? = r ? (n) = ^logg) . 

In this setting, there should exist a critical value r/ cr it , such that when r\ < r/ c rft = 
^o/C^crit' the probability of error is very close to 0. 

To achieve this goal, we simulate N = 1, 400 samples of n independent random 
variables X? , . . . , X° ~ M(0, I p ). It yields S?, • • • , Sjy that are drawn under Hq, 
and used to estimate the quantiles g.oi, q.os at 1% and 5% for the MDP^ statistic. 
The same process is repeated under Hi to estimate the probability of type II error 
P Hl (MDP k (£) > <?«)• To that end, we simulate X\, ... ,X* ~ Af(Q,I p + 9vv T ), 
for random unit vectors v supported on S = The restriction of v 

to S is distributed uniformly on the unit sphere of dimension k. To display a 
one-dimensional dependence, k is chosen equal to the integer part of y/p. 

Figure 2 illustrates a phase transition for the probability of testing error, at a 
critical level ?? cr it — 0.1 independent of (p, n, k). The concomitance of these curves 
for different choices of (p, n, k) indicates that r/ is the correct scaling factor for 
the MDPfc statistic. This suggests that the upper bound for convex detection that 
we prove in (6.10) is pessimistic and that MDPfc(S) is an even better proxy for 
A max (S) than predicted by our theory. 
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Scaling factor tj(ti) Scaling factor t)(n) 



Figure 2. Type II errors as a function ofr){ri) forp = {50, 100, 200, 500}, k = [y/p] , N = 1, 400. 
Left: a = 5%, right: a = 1% 
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